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Abstract 

We present a novel hierarchical distance- 
dependent Bayesian model for event coref¬ 
erence resolution. While existing generative 
models for event coreference resolution are 
completely unsupervised, our model allows 
for the incorporation of pairwise distances be¬ 
tween event mentions — information that is 
widely used in supervised coreference mod¬ 
els to guide the generative clustering process¬ 
ing for better event clustering both within and 
across documents. We model the distances 
between event mentions using a feature-rich 
learnable distance function and encode them 
as Bayesian priors for nonparametric cluster¬ 
ing. Experiments on the ECB-t corpus show 
that our model outperforms state-of-the-art 
methods for both within- and cross-document 
event coreference resolution. 


1 Introduction 


The task of event coreference resolution consists of 
identifying text snippets that describe events, and 
then clustering them such that all event mentions in 
the same partition refer to the same unique event. 
Event coreference resolution can be applied within 
a single document or across multiple documents 
and is crucial for many natural language process¬ 
ing tasks including topic detection and tracking, in¬ 
formation extraction, question answering and tex¬ 


tual entailment (Bejan and Harabagiu, 20101. More 


importantly, event coreference resolution is a neces¬ 
sary component in any reasonable, broadly applica¬ 
ble computational model of natural language under¬ 
standing (Humphreys et al., 1997||. 


In comparison to entity coreference resolu¬ 
tion (Ng, 20101, which deals with identifying and 
grouping noun phrases that refer to the same dis¬ 
course entity, event coreference resolution has not 
been extensively studied. This is, in part, because 
events typically exhibit a more complex structure 
than entities: a single event can be described via 
multiple event mentions, and a single event mention 
can be associated with multiple event arguments that 
characterize the participants in the event as well as 


spatio-temporal information (Bejan and Harabagiu, 


2010|). Hence, the coreference decisions for event 


mentions usually require the interpretation of event 
mentions and their arguments in context. See, for 
example. Figure in which five event mentions 
across two documents all refer to the same under¬ 
lying event: Plane bombs Yida camp. 


Event: Plane bombs Yida camp 


The {Yida refugee'^camp}/wasthe 
target of an air strike (in South 
Sudan} (on Thursday}. • 

(Four bombs} were dropped 
within just a few moments - {two} 
{inside the camp itself}, while {the 
other two} {near the airstrip}. 


The {Yida'refugee camp} {in South 
Sudan} was' pombed {on Thursday}. 

{At least fourlpombs} were reportedly 
dropped . \ 

{Two bombs} {within 

the Yida camp}, including {one} {close 

to the school}. 


Document 1 Document 2 


Eigure 1: Examples of event coreference. Mutually 
coreferent event mentions are underlined and in boldface; 
participant and spatio-temporal information for the high¬ 
lighted event is marked by curly brackets. 


Most previous approaches to event coreference 
resolution (e.g., Ahn (20061, Chen et al. (20091) op¬ 
erated by extending the supervised pairwise classi- 






















fication model that is widely used in entity corefer¬ 


ence resolution (e.g., Ng and Cardie (2002 1). In this 
framework, pairwise distances between event men¬ 
tions are modeled via event-related features (e.g., 
that indicate event argument compatibility), and ag- 
glomerative clustering is applied to greedily merge 
event mentions into clusters. A major drawback of 
this general approach is that it makes hard decisions 
on the merging and splitting of clusters based on 
heuristics derived from the pairwise distances. In 
addition, it only captures pairwise coreference deci¬ 
sions within a single document and can not account 
for signals that commonly appear across documents. 
More recently, Bejan and Harabagiu ( |2010t 20141 
proposed several nonparametric Bayesian models 
for event coreference resolution that probabilisti¬ 
cally infer event clusters both within a document and 
across multiple documents. Their method, however, 
is completely unsupervised, and thus can not en¬ 
code any readily available supervisory information 
to guide the model toward better event clustering. 

To address these limitations, we propose a novel 
Bayesian model for within- and cross-document 
event coreference resolution. It leverages super¬ 
vised feature-rich modeling of pairwise coreference 
relations and generative modeling of cluster distri¬ 
butions, and thus allows for both probabilistic in¬ 
ference over event clusters and easy incorporation 
of pairwise linking preferences. Our model builds 
on the framework of the distance-dependent Chi¬ 


nese restaurant process (DDCRP) (Blei and Frazier, 


20111, which was introduced to incorporate data de¬ 
pendencies into nonparametric clustering models. 
Here, however, we extend the DDCRP to allow 
the incorporation of feature-based, learnable dis¬ 
tance functions as clustering priors, thus encourag¬ 
ing event mentions that are close in meaning to be¬ 
long to the same cluster. In addition, we introduce to 
the DDCRP a representational hierarchy that allows 
event mentions to be grouped within a document and 
within-document event clusters to be grouped across 
documents. 

To investigate the effectiveness of our approach, 
we conduct extensive experiments on the ECB-i- cor¬ 
pus (ICybulska and Vossen, 2014b b, an extension 


to EventCorefBank (ECB) ([Bejan and Harabagiu, 


documents. We show that integrating pairwise 
learning of event coreference relations with unsu¬ 
pervised hierarchical modeling of event clustering 
achieves promising improvements over state-of-the- 
art approaches for within- and cross-document event 
coreference resolution. 

2 Related Work 

Coreference resolution in general is a difficult natu¬ 
ral language processing (NEP) task and typically re¬ 
quires sophisticated inferentially-based knowledge- 
intensive models ( jKehler, 2002| ). Extensive work in 
the literature focuses on the problem of entity coref¬ 
erence resolution and many techniques have been 
developed, including rule-based deterministic mod¬ 


els (e.g. [Cardie and Wagstaff (1999j l, [Raghunathan 
et al. (2010 ), Eee et al. (201 Ij l) that traverse over 
mentions in certain orderings and make determin¬ 
istic coreference decisions based on all available 
information at the time; supervised learning-based 
models (e.g. Stoyanov et al. (2009| l, Rahman and Ng 
(201 1[ ), [Durrett and Klein (2013| )) that make use of 
rich linguistic features and the annotated corpora to 
learn more powerful coreference functions; and fi¬ 


nally, unsupervised models (e.g. Bhaffacharya and 
Gefoor (2006| l, Haghighi and Klein (2007, 2010)) 
that successfully apply generative modeling to the 
coreference resolution problem. 

Event coreference resolution is a more complex 
task than entity coreference resolution ( [Humphrey 
et ah, 1997| | and also has been relatively less stud¬ 
ied. Existing work has adapted similar ideas to 


those used in entity coreference. Humphreys et 


al. (19971 first proposed a deterministic cluster¬ 
ing mechanism to group event mentions of pre¬ 
specified fypes based on hard constraints. Eater ap¬ 


proaches (Ahn, 2006 Chen et ah, 20091 applied 


20101 and the largest corpus available that contains 
event coreference annotations within and across 


learning-based pairwise classification decisions us¬ 
ing event-specific features to infer event clustering. 
Bejan and Harabagiu (2010 2014| ) proposed sev¬ 
eral unsupervised generative models for event men¬ 
tion clustering based on the hierarchical Dirichlet 
process (HDP) ( Teh et ah, 2006| |. Our approach 
is related to both supervised clustering and gener¬ 
ative clustering approaches. It is a nonparametric 
Bayesian model in nature but encodes rich linguis¬ 
tic features in clustering priors. More recent work 




























































modeled both entity and event information in event 
coreference. |Lee et al. (2012| ) showed that itera¬ 
tively merging entity and event clusters can boost 
the clustering performance. Liu et al. (20141 demon¬ 
strated the benefits of propagating information be¬ 
tween event arguments and event mentions during 
a post-processing step. Other work modeled event 
coreference as a predicate argument alignment prob¬ 
lem between pairs of sentences, and trained clas¬ 


sifiers for making alignment decisions (Roth and 


Frank, 2012t [Wolfe et al., 2015 ). Our model also 


leverages event argument information into the de¬ 
cisions of event coreference but incorporates it into 
Bayesian clustering priors. 


Most existing coreference models, both for events 
and entities, focus on solving the within-document 
coreference problem. Cross-document coreference 
has attracted less attention due to lack of annotated 
corpora and the requirement for larger model capac¬ 
ity. Hierarchical models (Singh et al., 2010[|Wick et 


al., 2012t Haghighi and Klein, 20071 have been pop¬ 


ular choices for cross-document coreference as they 
can capture coreference at multiple levels of gran¬ 
ularities. Our model is also hierarchical, capturing 
both within- and cross-document coreference. 


Our model is also closely related to the 
distance-dependent Chinese Restaurant Process 
(DDCRP) QElei and Frazier, 201 Ij l. The DDCRP 
is an infinite clustering model that can account for 
data dependencies ([Ghosh et al., 2011 Socher et al. 


201 1[). But it is a flat clustering model and thus can¬ 


not capture hierarchical structure that usually exists 
in large data collections. Very little work has ex¬ 
plored the use of DDCRP in hierarchical clustering 
models. [Kim and Oh (20 iT] [Ghosh et al. (20 IT] ) 
combined a DDCRP with a standard CRP in a two- 
level hierarchy analogous to the HDP with restricted 
distance functions. Ghosh et al. (20141 proposed 
a two-level DDCRP with data-dependent distance- 
based priors at both levels. Our model is also a two- 
level DDCRP model but differs in that its distance 
function is learned using a feature-rich log-linear 
model. We also derive an effective Gibbs sampler 
for posterior inference. 


Action 

bombs 

Participant 

Sudan, Yida refugee camp 

Time 

Thursday, Nov 10, 2011 

Location 

South Sudan 


Table 1; Mentions of event components 


3 Problem Formulation 


We adopt the terminology from ECB-i- (Cybulska 


and Vossen, 2014b I, a corpus that extends the widely 


used EventCorefBank (ECB (Bejan and Harabagiu, 


20101). An event is something that happens or a sit¬ 


uation that occurs ( [Cybulska and Vossen, 2014a I. It 
consists of four components: (1) an Action-, what 
happens in the event; (2) Participants', who or what 
is involved; (3) a Time', when the event happens; and 
(4) a Location: where the event happens. We as¬ 
sume that each document in the corpus consists of a 
set of mentions — text spans — that describe event 
actions, their participants, times, and locations. Ta¬ 
ble [T] shows examples of these in the sentence “Su¬ 
dan bombs Yida refugee camp in South Sudan on 
Thursday, Nov 10th, 2011.” 

In this paper, we also use the term event men¬ 
tion to refer to the mention of an event action, and 
event arguments to refer collectively to mentions of 
the participants, times and locations involved in the 
event. Event mentions are usually noun phrases or 
verb phrases that clearly describe events. Two event 
mentions are considered coreferent if they refer to 
the same actual event, i.e. a situation involving a par¬ 
ticular combination of action, participants, time and 
location. Note that in text, not all event arguments 
are always present for an event mention; they may 
even be distributed over different sentences. Thus 
whether two event mentions are coreferential should 
be determined based on the context. Eor example, 
in Eigure [T] the event mention dropped in DOCU¬ 
MENT 1 corefers with air strike in the same docu¬ 
ment as they describe the same event. Plane bombs 
Yida camp, in the discourse context; it also corefers 
with dropped in DOCUMENT 2 based on the con¬ 
texts of both documents. 


The problem of event coreference resolution can 
be divided into two sub-problems: (1) event ex¬ 
traction: extracting event mentions and event ar¬ 
guments, and (2) event clustering: grouping event 








































mentions into clusters according to their corefer¬ 
ence relations. We consider both within- and cross¬ 
document event coreference resolution and hypothe¬ 
size that leveraging context information from multi¬ 
ple documents will improve both within- and cross¬ 
document coreference resolution. In the following, 
we first describe the event extraction step and then 
focus on the event clustering step. 


4 Event Extraction 


The goal of event extraction is to extract from a text 
all event mentions (actions) and event arguments 
(the associated participants, times and locations). 
One might expect that event actions could be ex¬ 
tracted reasonably well by identifying verb groups; 
and event arguments, by applying semantic role la¬ 
beling (SRL) to identify, for example, the Agent and 
Patient of each predicate. Unfortunately, most SRL 
systems only handle verbal predicates and so would 
miss event mentions described via noun phrases. In 
addition, SRL systems are not designed to capture 
event-specific arguments. Accordingly, we found 
that a state-of-the-art SRL system (SwiRL ( |Sur- 
deanu et ah, 2007) ) extracted only 56% of the ac¬ 
tions, 76% of participants, 65% of times and 13% of 
locations for events in a development set of ECB-i- 
based on a head word matching evaluation measure. 
(We provide dataset details in Section]^) 

To produce higher recall, we adopt a supervised 
approach and train an event extractor using sen¬ 
tences from ECB-I-, which are annotated for event 
actions, participants, times and locations. Be¬ 
cause these mentions vary widely in their length 
and grammatical type, we employ semi-Markov 
CREs ( |Sarawagi and Cohen, 2004 1 using the loss- 
augmented objective of Yang and Cardie (2014| ) that 
provides more accurate detection of mention bound¬ 
aries. We make use of a rich feature set that includes 
word-level features such as unigrams, bigrams, POS 
tags, WordNet hypemyms, synonyms and ErameNet 
semantic roles, and phrase-level features such as 
phrasal syntax (e.g., NP, VP) and phrasal embed¬ 
dings (constructed by averaging word embeddings 


produced by word2vec (Mikolov et ah, 20131). Our 


experiments on the same (held-out) development 
data show that the semi-CRE-based extractor cor¬ 
rectly identifies 95% of actions, 90% of participants. 


94% of times and 74% of locations again based on 
head word matching. 

Note that the semi-CRE extractor identifies event 
mentions and event arguments but not relation¬ 
ships among them, i.e. it does not associate argu¬ 
ments with an event mention. Eacking supervi¬ 
sory data in the ECB-i- corpus for training an event 
action-argument relation detector, we assume that 
all event arguments identified by the semi-CRE ex¬ 
tractor are related to all event mentions in the same 
sentence and then apply SRE-based heuristics to 
augment and further disambiguate intra-sentential 
action-argument relations (using the SwiRL SRL). 
More specifically, we link each verbal event men¬ 
tion to the participants that match its ARGO, ARGl 
or ARGl semantic role fillers; similarly, we asso¬ 
ciate with the event mention the time and locations 
that match its AM-TMP and AM-LOC role fillers, re- 
specfively. Eor each nominal event mention, we as¬ 
sociate those participants that match the possessor of 


the mention since these were suggested in Lee et al. 


(20121 as playing the ARGO role for nominal predi¬ 


cates. 


5 Event Clustering 

Now we describe our proposed Bayesian model for 
event clustering. Our model is a hierarchical exten¬ 
sion of the distance-dependent Chinese Restaurant 
Process (DDCRP). It first groups event mentions 
within a document to form within-document event 
cluster and then groups these event clusters across 
documents to form global clusters. The model can 
account for the similarity between event mentions 
during the clustering process, putting a bias toward 
clusters comprised of event mentions that are simi¬ 
lar to each other based on the context. To capture 
event similarity, we use a log-linear model with rich 
syntactic and semantic features, and learn the feature 
weights using gold-standard data. 

5.1 Distance-dependent Chinese Restaurant 
Process 

The Distance-dependent Chinese Restaurant Pro¬ 
cess (DDCRP) is a generalization of the Chinese 
Restaurant process (CRP) that models distributions 
over partitions. In a CRP, the generative process can 
be described by imagining data points as customers 

















in a restaurant and the partitioning of data as tables 
at which the customers sit. The process randomly 
samples the table assignment for each customer se¬ 
quentially: the probability of a customer sitting at an 
existing table is proportional to the number of cus¬ 
tomers already sitting at that table and the probabil¬ 
ity of sitting at a new table is proportional to a scal¬ 
ing parameter. For each customer sitting at the same 
table, an observation can be drawn from a distri¬ 
bution determined by the parameter associated with 
that table. Despite the sequential sampling process, 
the CRP makes the assumption of exchangeability: 
the permutation of the customer ordering does not 
change the probability of the partitions. 

The exchangeability assumption may not be rea¬ 
sonable for clustering data that has clear inter¬ 
dependencies. The DDCRP allows the incorporation 
of data dependencies in infinite clustering, encour¬ 
aging data points that are closer to each other to be 
grouped together. In the generative process, instead 
of directly sampling a table assignment for each cus¬ 
tomer, it samples a customer link, linking the cus¬ 
tomer to another customer or itself. The clustering 
can be uniquely constructed once the customer links 
are determined for all customers: two customers be¬ 
long to the same cluster if and only if one can reach 
the other by traversing the customer links (treating 
these links as undirected). 

More formally, consider a sequence of customers 
1, ...,n, and denote a = (oi, ...,On) as the assign¬ 
ments of the customer links. Oj G {1,... ,n} is 
drawn from 


p{ai 


j\F,a) oc 


a, 


j + i 

j = i 


( 1 ) 


where F is a distance function and F{i, j) is a value 
that measures the distance between customer i and 
j. a is a scaling parameter, measuring self-affinity. 
For each customer, the observation is generated by 
the per-table parameters as in the CRP. A DDCRP 
is said to be sequential if F{i,j) = 0 when i < j, 
so customers may link only to themselves, and to 
previous customers. 


5.2 A Hierarchical Extension of the DDCRP 

We can model within-document coreference reso¬ 
lution using a sequential DDCRP. Imagining cus¬ 
tomers as event mentions and the restaurant as a 


document, each mention can either refer to an an¬ 
tecedent mention in the document or no other men¬ 
tions, starting the description of a new event. How¬ 
ever, the coreference relations may also exist across 
documents — the same event may be described in 
multiple documents. Thus it is ideal to have a two- 
level clustering model that can group event men¬ 
tions within a document and further group them 
across documents. Therefore we propose a hierar¬ 
chical extension of the DDCRP (HDDCRP) that em¬ 
ploys a DDCRP twice: the first-level DDCRP links 
mentions based on within-document distances and 
the-second level DDCRP links the within-document 
clusters based on cross-document distances, forming 
larger clusters in the corpus. 

The generative process of an HDDCRP can be 
described using the same “Chinese Restaurant” 
metaphor. Imagine a collection of documents as a 
collection of restaurants, and the event mentions in 
each document as customers entering a restaurant. 
The local (within-document) event clusters corre¬ 
spond to tables. The global (within-corpus) event 
clusters correspond to menus (tables that serve the 
same menu belong to the same cluster). The hid¬ 
den variables are the customer links and the table 
links. Figure 1^ shows a configuration of these vari¬ 
ables and the corresponding clustering structure. 



Figure 2: A cluster configuration generated by the HDD¬ 
CRP. Each restaurant is represented by a rectangle. The 
small green circles represent customers. The ovals repre¬ 
sent tables and the colors reflect the clustering. Each cus¬ 
tomer is assigned a customer link (a solid arrow), linking 
to itself or another customer in the same restaurant. The 
customer who first sits at the table is assigned a table link 
(a dashed arrow), linking to itself or another customer in a 
different restaurant, resulting in the linking of two tables. 

More formally, the generative process for the HD¬ 
DCRP can be described as follows: 

1. For each restaurant d G D}, for each 





customer i G sample a customer 

link using a sequential DDCRP: 


p{ai^d 


ij,d)) oc < 


FdiiJ), 


otd, 

0 , 


3 < i 
j = i 
j > i 


( 2 ) 


2. For each restaurant d G {1, for each ta¬ 

ble t, sample a table link for the customer (i, d) 
who first sits at t using a DDCRP: 

P{ci 4 = {j,d')) oc 

f ^o((f d), (j, d')), j G {1, nd'},d' / d 
\ao, j = i^d' = d 

( 3 ) 


The Gibbs sampler for the HDDCRP iteratively 
samples a customer link for each customer (i, d) 
from 

where 


P(x|z(a_(i_d) Uai^,c,A)) 

Ha{x,z,X) = -—-——— 

p(x|z(a_(i^rf),c), A)) 

After sampling all the customer links, it samples 
a table link for all customers (z, d) according to 

P(ci,d|a,c_(i,rf),x, A) oc p(c*^)iTc(x, z, A) (5) 


3. Calculate clusters z(a, c) by traversing all the 
customer links a and the table links c. Two 
customers are in the same cluster if and only 
if there is a path from one to the other along the 
links, where we treat both table and customer 
links as undirected. 

4. For each cluster k G z(a, c), sample parame¬ 
ters 4>k ~ Go (A). 

5. For each customer i in cluster k, sample an ob¬ 
servation Xi ^ p{-\(f>z^) where Zi = k. 

Fi-d and Fq are distance functions that map a pair 
of customers to a distance value. We will discuss 
them in detail in Section l5Al 

5.3 Posterior Inference with Gibbs Sampling 

The central computation problem for the HDDCRP 
model is posterior inference — computing the con¬ 
ditional distribution of the hidden variables given the 
observations p(a,c|x, uq, Fq, ai:D, Fi-d). The pos¬ 
terior is intractable due to a combinatorial number 
of possible link configurations. Thus we approxi¬ 
mate the posterior using Markov Chain Monte Carlo 
(MCMC) sampling, and specifically using a Gibbs 
sampler. 

In developing fhis Gibbs sampler, we firsl observe 
fhaf fhe generafive process is equivalenf fo one fhaf, 
in step 1^ samples a fable link for all cusfomers, 
and fhen in step when calculating z(a, c), in¬ 
cludes only fhose fable links Ci^d originating af cus¬ 
fomers (z, d) fhaf sfarfed a new fable, i.e. fhaf chose 
— (Ij d). 


where 


P(x|z(a,c_(i,d) Uc.^,A)) 

Hc{x, z, A) =-—----T— 

p(x|z(a, A)) 

For fhose cusfomers (z, d) fhaf did nof sfarf a new 
fable, i.e. wifh ai^d / (*, d), the table link c*^ does 
not affect the clustering, and so Hc{x, z, A) = 1 in 
this case. 

Referring back to the event coreference example 
in [T] Figure shows an example of variable config¬ 
uration for the HDDCRP model and the correspond¬ 
ing coreference clusters. 







a1=1 a2=2 a3=3 a4=4 a5=4 

cl =3 c2=2 c3=2 c4=2 c5=5[ina] 


Figure 3: An example of event clustering and the cor¬ 
responding variable assignments. The assignments of a 
induce tables, or within-document (WD) clusters, and the 
assignments of c induce menus, or cross-document (CD) 
clusters, [ina] denotes that the variable is inactive and 
will not affect the clustering. 

In implementation, we can simplify the computa¬ 
tions of both Ha{x, z, A) and z, A) by using 
the fact that the likelihood under clustering z(a, c) 
can be factorized as 

p(x|z(a,c), A) = p(xz=fc|A) 

/cGz(a,c) 








where denotes all customers that belong to the 
global cluster k. p(xz=fc|A) is the marginal proba¬ 
bility. It can be computed as 

p(xz=fc|A) = / p{(t>\X) p(xi|(/>)# 

i^z=k 



Train 

Dev 

Test 

Total 

# Documents 

462 

73 

447 

982 

# Sentences 

7,294 

649 

7,867 

15,810 

# Annotated event mentions 

3,555 

441 

3,290 

7,286 

# Cross-document chains 

687 

47 

486 

1,220 

# Within-document chains 

2,499 

316 

2,137 

4,952 


Table 2: Statistics of the ECB-r corpus 


where Xi is the observation associated with cus¬ 
tomer i. In our problem, the observation corre¬ 
sponds to the lemmatized words in the event men¬ 
tion. We model the observed word counts using 
cluster-specific multinomial distributions with sym¬ 
metric Dirichlet priors. 

5.4 Feature-based Distance Functions 

The distance functions Fi-r) and Fq encode the pri¬ 
ors for the clustering distribution, preferring cluster¬ 
ing data points that are closer to each other. We con¬ 
sider event mentions as the data points and encode 
the similarity (or compatibility) between event men¬ 
tions as priors for event clustering. Specifically, we 
use a log-linear model fo esfimafe fhe similarify be- 
fween a pair of even! mentions (xi, Xj) 


feixi, Xj) oc exp{0^V'(xj, xj)} (6) 

where -i/) is a feafure vecfor, confaining a rich sef 
of fealures based on even! mentions i and j: ( 1 ) 
head word siring malch, (2) head POS pair, (3) co¬ 
sine similarify belween fhe head word embeddings 
(we use fhe pre-frained 300-dimensional word em¬ 
beddings from word2ve(0, (4) similarify belween 
fhe words in fhe evenl mentions (based on lerm fre¬ 
quency (TF) vectors), (5) fhe Jaccard coefficienl be- 
fween fhe WordNel synonyms of fhe head words, 
and ( 6 ) similarity belween fhe confexl words (a win¬ 
dow of fhree words before and affer each evenl men- 
fion). If bolh even! menfions involve parficipanls, 
we consider fhe similarify belween fhe words in fhe 
parlicipanf menfions based on fhe TF vectors, sim¬ 
ilarly for fhe time menfions and fhe localion men- 
lions. If fhe SRL role informalion is available, we 
also consider fhe similarify belween words in each 
SRL role, i.e. ArgO, Argl, Arg2. 

Training We Irain fhe parameler 6 using logis¬ 
tic regression wilh an L2 regularizer. We conslrucl 
the training data by considering all ordered pairs 

’https://code.google.com/p/word2vec/ 


of event mentions within a document, and also all 
pairs of event mentions across similar documents. 
To measure document similarity, we collect all men¬ 
tions of events, participants, times and locations in 
each document and compute the cosine similarity 
between the TF vectors constructed from all the 
event-related mentions. We consider two documents 
to be similar if their TF-based similarity is above a 
threshold a (we set it to 0.4 in our experiments). 

After learning 6, we set the within- 
document distances as Fd{i,j) = fo{xi,Xj), 
and the across-document distances as 
Fo{{hd),{j,d')) = w{d,d')f 0 {xi^d,Xj^d'), where 
w{d,d') = ex-p{'ysim{d,d')) captures document 
similarity where sim{d, d') is the TF-based sim¬ 
ilarity between document d and d', and 7 is a 
weight parameter. Higher 7 leads to a higher 
effect of document-level similarities on the linking 
probabilities. We set 7 = 1 in our experiments. 


6 Experiments 


We conduct experiments using the ECB-i- cor¬ 
pus (ICybulska and Vossen, 2014b|l, the largest 


available dataset with annotations of both within- 
document (WD) and cross-document (CD) event 
coreference resolution. It extends ECB 0.1 ( |Eee et 
ah, 2012) and ECB ( |Bejan and Harabagiu, 2010 ) 
by adding event argument and argument type an¬ 
notations as well as adding more news documents. 
The cross-document coreference annotations only 
exist in documents that describe the same seminal 
event (the event that triggers the topic of the docu¬ 
ment and has interconnections with the majority of 


events from its surrounding textual context (Bejan 


and Harabagiu, 20141). We divide the dataset into a 
training set (topics 1 - 20 ), a development set (topics 
21-23), and a test set (topics 24-43). Table shows 
the statistics of the data. 

We performed event coreference resolution on all 
possible event mentions that are expressed in the 

























documents. Using the event extraction method de¬ 
scribed in Section]^ we extracted 53,429 event men¬ 
tions, 43,682 participant mentions, 5,791 time men¬ 
tions and 3,836 location mentions in the test data, 
covering 93.5%, 89.0%, 95.0%, 72.8% of the an¬ 
notated event mentions, participants, time and loca¬ 
tions, respectively. 

We evaluate both within- and cross-document 
event coreference resolution. As in previous 
work ( |Bejan and Harabagiu, 20T0 ), we evaluate 
cross-document coreference resolution by merg¬ 
ing all documents from the same seminal event 
into a meta-document and then evaluate the meta¬ 
document as in within-document coreference reso¬ 
lution. However, during inference time, we do not 
assume the knowledge of the mapping of documents 
to seminal events. 

We consider three widely used coreference reso¬ 
lution metrics: (1) MUC ( [Vilain et ah, 1995] ), which 
measures how many gold (predicted) cluster merg¬ 
ing operations are needed to recover each predicted 
(gold) cluster; (2) (Bagga and Baldwin, 19981, 


which measures the proportion of overlap between 
the predicted and gold clusters for each mention and 
computes the average scores; and (3) CEAF ( |Luo, 
20051 (CEAFe), which measures the best alignment 
of the gold-standard and predicted clusters. We also 
consider the CoNEE El, which is the average El of 
the above three measures. All the scores are com¬ 
puted using the latest version (v8.01) of the official 
CoNEE scorer ( Pradhan et ah, 2014] ). 


6.1 Baselines 

We compare our proposed HDDCRP model (HDD- 
CRP) to five baselines: 

• Eemma: a heurisfic mefhod fhaf groups all 
even! mentions, eifher wifhin or across docu- 
menfs, which have the same lemmatized head 
word. It is usually considered a strong baseline 
for event coreference resolution. 


Agglomerative: a supervised clustering 
method for within-document event corefer¬ 
ence dChen et ah, 2009l l. We extend it to 
within- and cross-document event coreference 
by performing single-link clustering in two 
phases: first grouping mentions within doc¬ 
uments and then grouping within-document 


clusters to larger clusters across documents. 
We compute the pairwise-linkage scores using 


the log-linear model described in Section 5.4 


HDP-LEX: an unsupervised Bayesian clus¬ 
tering model for within- and cross-document 
event coreference ([B^jan and Harabagiu, 


2010|p It is a hierarchical Dirichlet process 


(HDP) model with the likelihood of all the lem¬ 
matized words observed in the event mentions. 
In general, the HDP can be formulated using a 
two-level sequential CRP. Our HDDCRP model 
is a two-level DDCRP that generalizes the HDP 
to allow data dependencies to be incorporated 
at both level^ 


• DDCRP: a DDCRP model we develop for event 
coreference resolution. It applies the distance 
prior in Equation to all pairs of event men¬ 
tions in the corpus, ignoring the document 
boundaries. It uses the same likelihood func¬ 
tion and the same log-linear model to learn 
the distance values as HDDCRP. But it has 
fewer link variables than HDDCRP and it does 
not distinguish between the within-document 
and cross-document link variables. For the 
same clustering structure, HDDCRP can gener¬ 
ate more possible link configurations than DD¬ 
CRP. 


• HDDCRP*: a variant of the proposed HDDCRP 
that only incorporates the within-document de¬ 
pendencies but not the cross-document depen¬ 
dencies. The generative process of HDDCRP* is 
similar to the one described in Section [5!2j ex¬ 
cept that in step 2, for each table t, we sample 

^We re-implement the proposed HDP-based models: the 
HDPij, HDP/i,( (including HDP/i,t (LF), (LF+WF), and 
(LF+WF+SF)) and HDPstruct, but found that the HDP/iat 
with lexical features (LF) performs the best in our experiments. 
We refer to it as HDP-LEX. 

^Note that HDP-LEX is not a special case of HDDCRP be¬ 
cause we define the table-level distance function as the distances 
between customers instead of between tables. In our model, the 
probability of linking a table t to another table s depends on 
the distance between the head customer at table t and all other 
customers who sit at table s. Defining the table-level distance 
function this way allows us to derive a tractable inference algo¬ 
rithm using Gibbs sampling. 























a cluster assignment ct according to 


p{ct = k) (X 



k<K 
k = K + l 


where K is the number of existing clusters, 
Uk is the number of existing tables that be¬ 
long to cluster k, a is the concentration param¬ 
eter. And in step 3, the clusters z(a, c) are con¬ 
structed by traversing the customer links and 
looking up the cluster assignments for the ob¬ 
tained tables. We also use Gibbs sampling for 
inference. 


6.2 Parameter settings 

For all the Bayesian models, the reported results are 
averaged results over five MCMC runs, each for 500 
iterations. We found that mixing happens before 
500 iterations in all models by observing the joint 
log-likelihood. For the DDCRP, HDDCRP* and HDD- 
CRP, we randomly initialized the link variables. Be¬ 
fore initialization, we assume that each mention be¬ 
longs to its own cluster. We assume mentions are 
ordered according to their appearance within a doc¬ 
ument, but we do not assume any particular ordering 
of documents. We also truncated the pairwise men¬ 
tion similarity to zero if it is below 0.5 as we found 
that it leads to better performance on the develop¬ 
ment set. We set ai = ... = an = 0.5, ao = 0.001 
for HDDCRP, ao = 1 for HDDCRP*, a = 0.1 for DD¬ 
CRP, and A = 10“^. All the hyperparameters were 
set based on the development data. 


mention dependencies in the generative modeling of 
event clustering. The improvements over Agglom- 
ERATIVE indicate that it is more effective to model 
mention-pair dependencies as clustering priors than 
as heuristics for deterministic clustering. 

Comparing among the HDDCRP-related models, 
we can see that hddcrp clearly outperforms dd- 
CRP, demonstrating the benefits of incorporating the 
hierarchy into the model. HDDCRP also performs 
better than HDDCRP* in WD CoNLL FI, indicat¬ 
ing that incorporating cross-document information 
helps within-document clustering. We can also see 
that HDDCRP performs similarly to HDDCRP* in CD 
CoNLL FI due to the lower B^ FI, in particular, 
the decrease in B^ recall. This is because apply¬ 
ing the DDCRP prior at both within- and cross¬ 
document levels results in more conservative clus¬ 
tering and produces smaller clusters. This could be 
potentially improved by employing more accurate 
similarity priors. 

To further understand the effect of modeling 
mention-pair dependencies, we analyze the impact 
of the features in the mention-pair similarity model. 
Tablej^lists the learned weights of some top features 
(sorted by weights). We can see that they mainly 
serve to discriminate event mentions based on the 
head word similarity (especially embedding-based 
similarity) and the context word similarity. Event 
argument information such as SRL Argl, SRL ArgO, 
and Participant are also indicative of the coreferen- 
tial relations. 


6.3 Main Results 

Table shows the event coreference results. We 
can see that LEMMA-matching is a strong baseline 
for event coreference resolution. HDP-LEX provides 
noticeable improvements, suggesting the benefit of 
using an infinite mixture model for event cluster¬ 
ing. Agglomerative further improves the per¬ 
formance over HDP-LEX for WD resolution, how¬ 
ever, it fails to improve CD resolution. We conjec¬ 
ture that this is due to the combination of ineffective 
thresholding and the prediction errors on the pair¬ 
wise distances between mention pairs across docu¬ 
ments. Overall, hddcrp* outperforms all the base¬ 
lines in CoNLL FI for both WD and CD evaluation. 
The clear performance gains over HDP-LEX demon¬ 
strate that it is important to account for pairwise 


6.4 Discussion 

We found that HDDCRP corrects many errors made 
by the traditional agglomerative clustering model 
(Agglomerative) and the unsupervised genera¬ 
tive model (HDP-LEX). Agglomerative easily 
suffers from error propagation as the errors made 
by the supervised distance learner cannot be cor¬ 
rected. HDP-LEX often mistakenly groups mentions 
together based on word co-occurrence statistics but 
not the apparent similarity features in the mentions. 
In contrast, HDDCRP avoids such errors by perform¬ 
ing probabilistic modeling of clustering and mak¬ 
ing use of rich linguistic features trained on avail¬ 
able annotated data. For example, HDDCRP cor¬ 
rectly groups the event mention “unveiled” in “Ap¬ 
ple’s Phil Schiller unveiled a revamped MacBook 
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Cross-document Event Coreference Resolution (CD) 

Lemma 

75.1 

55.4 

63.8 

71.7 

39.6 

51.0 

36.2 

61.1 

45.5 

53.4 

HDP-LEX 

75.5 

63.5 

69.0 

65.6 

43.7 

52.5 

34.8 

60.2 

44.1 

55.2 

Agglomerative 

78.3 

59.2 

67.4 

73.2 

40.2 

51.9 

30.2 

65.6 

41.4 

53.6 

DDCRP 

79.6 

58.2 

67.1 

78.1 

39.6 

52.6 

31.8 

69.4 

43.6 

54.4 

HDDCRP* 

77.5 

66.4 

71.5 

69.0 

48.1 

56.7 

38.2 

63.0 

47.6 

58.6 

HDDCRP 

80.3 

67.1 

73.1 

78.5 

40.6 

53.5 

38.6 

68.9 

49.5 

58.7 


Within-document Event Coreference Resolution (WD) 

Lemma 

60.9 

30.2 

40.4 

78.9 

57.3 

66.4 

63.6 

69.0 

66.2 

57.7 

HDP-LEX 

50.0 

39.1 

43.9 

74.7 

67.6 

71.0 

66.2 

71.4 

68.7 

61.2 

Agglomerative 

61.9 

39.2 

48.0 

80.7 

67.6 

73.5 

65.6 

76.0 

70.4 

63.9 

DDCRP 

71.2 

36.4 

48.2 

85.4 

64.9 

73.8 

61.8 

76.1 

68.2 

63.4 

HDDCRP* 

58.1 

42.8 

49.3 

78.4 

68.7 

73.2 

67.6 

74.5 

70.9 

64.5 

HDDCRP 

74.3 

41.7 

53.4 

85.6 

67.3 

75.4 

65.1 

79.8 

71.7 

66.8 


Table 3: Within- and cross-document coreference results on the ECB-t corpus 


Pro today” together with the event mention “an- 
nouneed” in “this notebook isn ’t the only laptop Ap¬ 
ple announced for the MacBook Pro lineup today”, 
while both HDP-LEX and Agglomerative models 
fail to make sueh eonneetion. 

By looking further into the errors, we found that 
a lot of mistakes made by HDDCRP are due to the 
errors in event extraetion and pairwise linkage pre- 
dietion. The event extraetion errors inelude false 
positive and false negative event mentions and event 
arguments, boundary errors for the extraeted men¬ 
tions, and argument assoeiation errors. The pairwise 
linking errors often eome from the laek of seman- 
tie and world knowledge, and this applies to both 
event mentions and event arguments, espeeially for 
time and loeation arguments whieh are less likely 
to be repeatedly mentioned and in many eases re¬ 
quire external knowledge to resolve their meanings, 
e.g., “May 3, 2013” is “Friday” and “Mount Cook” 
is “New Zealand’s highest peak”. 

1 Conclusion 

In this paper we propose a novel Bayesian model 
for within- and eross-doeument event eoreferenee 
resolution. It leverages the advantages of genera¬ 
tive modeling of eoreferenee resolution and feature- 
rieh diseriminative modeling of mention referenee 
relations. We have shown its power in resolving 
event eoreferenee by eomparing it to a traditional ag- 


Features 

Weight 

Head Embedding sim 

4.5 

String match 

2.77 

Context sim 

1.75 

Synonym sim 

1.56 

TF sim 

1.17 

SRL Argl sim 

1.10 

SRL ArgO sim 

0.89 

Participant sim 

0.68 


Table 4: Learned weights for selected features 

glomerative clustering approach and a state-of-the- 
art unsupervised generative clustering approach. It 
is worth noting that our model is general and can be 
easily applied to other clustering problems involving 
feature-rich objects and cluster sharing across data 
groups. While the model can effectively cluster ob¬ 
jects of a single type, it would be interesting to ex¬ 
tend it to allow joint clustering of objects of different 
types, e.g., events and entities. 
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